Neural Machine Translation - French to English

by Pritish Jadhav, Mrunal Jadhav - Sun, 12 Jul 2020

Tags: #python #Keras #deep learning #machine translation

$\huge Neural\ Machine\ Translation$

Neural Machine Translation is the use of Deep Neural Networks for translating a text from one language (source language) to its counterpart in other language (target language).

In this notebook, we will use sequence to sequence modelling to translation French to English.

BIRD'S EYE VIEW OF THE BLOG¶

Formulate the problem - Sequence to Sequence Approach
Getting Acquainted with the Dataset
Preprocessing and Preparing Text Data
Define and Train the Model
Inference
Model Evaluation

1. Sequence to Sequence Modelling¶

A RNN is a feedforward network which uses internal memory (states) to process a sequence of inputs. RNN performs same computation on every input of data and the output is sent back intto the same network. Thus Recurrent Neural Networks use their reasoning from previous experiences to inform the upcoming events.

Let $T_x$ be the length of the input and $T_y$ be the length of the output. A variety of RNNs based on input and output length are :

One to One (Fig 1) : $T_x\ =\ T_y\ =\ 1$
Many to One (Fig 2) : $T_x\ >\ 1\ ;\ T_y\ =\ 1$
One to Many (Fig 3) : $T_x\ =\ 1\ ;\ T_y\ >\ 1$
Many to Many (Fig 4.1) : $T_x\ >\ 1\ ;\ T_y\ >\ 1$

Consider our task of French to English translation.
French Sentence : "je pense que vous vous etes tous rencontres"
English Translation : "i think that youve all met"

The architecture in Fig 4.1 can be applied when the length of the input and output are same. ($T_x$ = $T_y$). The French sentence consists of 8 words whereas its counter English translation consists of 6 words. To model such situations, instead of normal many-to-many architecture, we use sequence to sequence models. Seq2Seq is a many-to-many RNN architecture when $T_x\ !=\ T_y$

Encoder -- The Encoder consists of RNN Units which encapsulates the data to a fixed size intermediate representation.

Intermediate Representation -- It is the final hidden state produced by the encoder LSTM. It encapsulates the information of the input and helps the decoder to make predictions. It is fed as an hidden state input to the first cell of the LSTM- Decoder.

Decoder -- The Decoder consists of RNN units. It accepts a hidden state from the previous unit and produces and output as well as its own hidden state.

In [1]:

# import important libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import os
import string
from numpy import array
from pickle import dump
from unicodedata import normalize
import unicodedata
import unittest
import logging

import zipfile
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

from nltk.tokenize import sent_tokenize, word_tokenize

from PIL import Image

import pickle
from tqdm import tqdm

# for preparing text data
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.callbacks import EarlyStopping

# model-layers
from keras.layers import Input, LSTM, Dense, Embedding
from keras.models import Model

# plotting the defined model architecture
from keras.utils import plot_model

# word-cloud libraries
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction

%load_ext nb_black

from IPython.display import HTML

display(
    HTML("<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>")
)
# display(HTML("<style>.container { width:100% !important; }</style>"))

Using TensorFlow backend.

In [2]:

# declare constants needed for training the deep learning model.

# set the following parameter to True to limit the data used through the notebok
n_samples = False
# number of samples
NUM_SAMPLES = 20000

LATENT_DIM = 256
EPOCHS = 25
BATCH_SIZE = 25

# Initialise unittest instance object
tc = unittest.TestCase()

# Wordcloud paramters
MAX_WORDS = 600
MAX_FONT_SIZE = 120
version_name = "version1"

# initialize logger
logger = logging.getLogger("machine_translation")
logger.setLevel(logging.DEBUG)

# create console handler and set level to debug
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)

# create formatter
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")

# add formatter to ch
ch.setFormatter(formatter)

# add ch to logger
logger.addHandler(ch)

2. Read and Explore Dataset¶

To learn sequence to sequence modelling for Neural Machine Translation, we have downloaded the Dataset with French phrases along with their English counterparts.

The dataset can be downloaded from: German- English Translations
OR
You can just run the cell below :)

The Tab-delimited Bilingual Sentence Pairs online page contains various language pairs. The sentences are selected from Tatoeba Project.
Each Dataset is represented in the following format : English + TAB + The Other Language + TAB + Attribution

Feel free to explore and experiment!

In [3]:

def download_data(
    version_name="version1", destination_dir=os.path.join(".", "data", "fra-eng")
):

    data_path = os.path.join(destination_dir, "fra.txt")
    if not os.path.exists(data_path):
        logger.debug(f"Creating directory tree - {destination_dir}")
        Path(destination_dir).mkdir(parents=True, exist_ok=True)

        !wget http://www.manythings.org/anki/fra-eng.zip

        filename = os.path.join(destination_dir, "fra-eng.zip")

        os.rename("fra-eng.zip", filename)

        with zipfile.ZipFile(filename, "r") as zip:
            logger.debug(f"Extracting the downloaded zip file - {filename} now...")
            zip.extractall(destination_dir)
            logger.debug(f"Done ! Downloaded {filename} at {destination_dir}")
    else:
        logger.debug(f"Data Available at the right path- {destination_dir}.")

    # create directory to save model weights

    if not os.path.exists(version_name):
        Path(version_name).mkdir(parents=True, exist_ok=True)

    return data_path


file_path = download_data()

2020-07-13 02:47:32,214 - machine_translation - DEBUG - Creating directory tree - ./data/fra-eng

--2020-07-13 02:47:32--  http://www.manythings.org/anki/fra-eng.zip
Resolving www.manythings.org (www.manythings.org)... 104.24.108.196, 172.67.173.198, 104.24.109.196, ...
Connecting to www.manythings.org (www.manythings.org)|104.24.108.196|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6041598 (5.8M) [application/zip]
Saving to: ‘fra-eng.zip’

fra-eng.zip         100%[===================>]   5.76M  6.08MB/s    in 0.9s    

2020-07-13 02:47:33 (6.08 MB/s) - ‘fra-eng.zip’ saved [6041598/6041598]

2020-07-13 02:47:33,843 - machine_translation - DEBUG - Extracting the downloaded zip file - ./data/fra-eng/fra-eng.zip now...
2020-07-13 02:47:34,026 - machine_translation - DEBUG - Done ! Downloaded ./data/fra-eng/fra-eng.zip at ./data/fra-eng

In [4]:

raw_texts = pd.read_csv(
    file_path,
    encoding="utf-8",
    sep="\t",
    header=None,
    names=["target", "input", "comments"],
    usecols=["target", "input"],
)
display(raw_texts.head())
logger.debug(f"The shape of the data loaded : {raw_texts.shape}")

	target	input
0	Go.	Va !
1	Hi.	Salut !
2	Hi.	Salut.
3	Run!	Cours !
4	Run!	Courez !

2020-07-13 02:47:38,105 - machine_translation - DEBUG - The shape of the data loaded : (177210, 2)

In [7]:

##visualising the French data using word cloud

cloud_text_input = str(list(raw_texts["input"]))
cloud_mask_input = np.array(Image.open("images/color1.jpg"))
cloud_title_input = "Most common words in French"

wordcloud = WordCloud(
    background_color="white",
    max_words=MAX_WORDS,
    max_font_size=MAX_FONT_SIZE,
    random_state=42,
    mask=cloud_mask_input,
)
wordcloud.generate(cloud_text_input)
plt.figure(figsize=(10, 10))
image_colors = ImageColorGenerator(cloud_mask_input)
plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear")
plt.title(cloud_title_input, fontdict={"size": 20, "verticalalignment": "bottom"})
plt.axis("off")
plt.tight_layout()

##visualising the English data using word cloud

cloud_text_input = str(list(raw_texts["target"]))
cloud_mask_input = np.array(Image.open("images/color2.jpg"))
cloud_title_input = "Most common words in English"

wordcloud = WordCloud(
    background_color="white",
    max_words=MAX_WORDS,
    max_font_size=MAX_FONT_SIZE,
    random_state=42,
    mask=cloud_mask_input,
)
wordcloud.generate(cloud_text_input)
plt.figure(figsize=(10, 10))
image_colors = ImageColorGenerator(cloud_mask_input)
plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear")

plt.title(cloud_title_input, fontdict={"size": 20, "verticalalignment": "bottom"})
plt.axis("off")
plt.tight_layout()

3. Text Data Preprocessing and Preparation¶

Data Preprocessing is an essential step in Machine Learning model. These steps performed on text data also significant in reducing dimensionality.
List of Preprocessing steps performed on each component of Text Data are :

Removing Punctuations.

Lowercasing - The raw texts consists of Uppercase and Lowercase letters. Thus the model may represent "hello" and "hEllo" with two different word vectors.

French texts may contain special characters. They are converted to standard printables.

Removing Extra Spaces.

In [5]:

def preprocess_texts(raw_text):
    table = str.maketrans("", "", string.punctuation)
    re_print = re.compile("[^%s]" % re.escape(string.printable))
    clean_texts = []
    for data_line in raw_text:

        clean_text = (
            unicodedata.normalize("NFD", data_line)
            .encode("ascii", "ignore")
            .decode("utf8")
        )

        ##remove extra spaces
        clean_text = clean_text.split(" ")

        # change the words to lower
        clean_text = [word.lower() for word in clean_text]

        # remove punctuations
        clean_text = [word.translate(table) for word in clean_text]

        # remove non-printables
        clean_text = [re_print.sub("", w) for w in clean_text]

        clean_texts.append(" ".join(clean_text))

    return clean_texts

In [6]:

preprocessed_data_path = os.path.join(
    ".", "data", "fra-eng", version_name, "preprocessed_fra-eng.csv"
)
if not os.path.exists(preprocessed_data_path):

    preprocessed_data_dir = "/".join(preprocessed_data_path.split("/")[:-1])
    Path(preprocessed_data_dir).mkdir(parents=True, exist_ok=True)

    data_texts = raw_texts[["input", "target"]].apply(preprocess_texts, axis=0)

    logger.debug(f"Saving Preprocessed Data to {preprocessed_data_path}")
    data_texts.to_csv(preprocessed_data_path, index=False)

    if n_samples:
        data_texts = data_texts[:NUM_SAMPLES]
        tc.assertEqual(len(data_texts), NUM_SAMPLES)

else:
    data_texts = pd.read_csv(preprocessed_data_path)
    if n_samples:
        data_texts = data_texts[:NUM_SAMPLES]
        tc.assertEqual(len(data_texts), NUM_SAMPLES)

    logger.info(f"Read Preprocessed Data Successfully!")

2020-07-13 02:48:00,705 - machine_translation - DEBUG - Saving Preprocessed Data to ./data/fra-eng/version1/preprocessed_fra-eng.csv

Teacher Forcing¶

Tag the start and end of the target sentence using sos and eos respectively for training and inference¶

Sequence Prediction Models often use output from the last time step y(t-1) as input for the model at the current time step X(t). However during the early stages of learning, the hidden states of the model are often updated by a sequence of wrong predictions. Feeding this as an input to the current timestep causes errors to accumulate, and it is difficult for the model to learn from that.

Teacher Forcing remedies this by using the output in the prior time step(available in the training data) to compute the system state in the current time step. Instead of summing activations from incoming units, each unit sums correct teacher activations as input for the next iteration. It is a procedure for training RNNs with output to hidden recurrence which emerges from maximum likelihood criterion. Thus the model converges faster.

In [7]:

# manipulate data to support Teacher Forcing.

data_texts["decoder_input"] = "<sos> " + data_texts["target"]
data_texts["decoder_target"] = data_texts["target"] + " <eos>"

display(data_texts.head())

	input	target	decoder_input	decoder_target
0	va	go	<sos> go	go <eos>
1	salut	hi	<sos> hi	hi <eos>
2	salut	hi	<sos> hi	hi <eos>
3	cours	run	<sos> run	run <eos>
4	courez	run	<sos> run	run <eos>

Split into training, validation and testing sets¶

In [19]:

data_train_texts, data_test = train_test_split(data_texts, test_size=0.2)
data_test_texts, data_valid_texts = train_test_split(data_test, test_size=0.5)
print(f"The Data Train split has {data_train_texts.shape[0]} records")
print(f"The Data Valid split has {data_valid_texts.shape[0]} records")
print(f"The Data Test split has {data_test_texts.shape[0]} records")

The Data Train split has 141768 records
The Data Valid split has 17721 records
The Data Test split has 17721 records

Create the dictionary for both source and target languages.¶

Now we have to feed these source and the corresponding target sentences to our Encoder-Decoder Model. We cannot work directly with the text. Therefore, we need to map the captions to numerical vectors.

To do this, an internal vocabulary(dictionary of words) from the list of texts must be constructed. This vocabulary should be constructed based on the word frequency. Now each sentence can be converted to sequence of integers based on this dictionary of words. So it basically takes each word in the text and replaces it with its corresponding integer value from the word-to-index dictionary.

To accomplish this task, we used the sophisticated API provided by Keras that can be fit on multiple documents ---Tokenizer.

fit_on_texts - Fits list of texts on the Tokenizer object and updates internal vocabulary based on a list of texts. Lower integer means more frequent word. 0 is reserved for padding.

In [12]:

encoder_tokenizer_pickle_path = os.path.join(
    ".", "data", "fra-eng", version_name, "french_tokenizer.pkl"
)
decoder_tokenizer_pickle_path = os.path.join(
    ".", "data", "fra-eng", version_name, "english_tokenizer.pkl"
)

# train or load Tokenizers for encoder and decoder
if (not os.path.exists(encoder_tokenizer_pickle_path)) or (
    not os.path.exists(decoder_tokenizer_pickle_path)
):

    encoder_tokenizer = Tokenizer(filters="")
    encoder_tokenizer.fit_on_texts(data_train_texts["input"])

    decoder_tokenizer = Tokenizer(filters="")
    decoder_tokenizer.fit_on_texts("<sos> " + data_train_texts["target"] + " <eos>")

    with open(encoder_tokenizer_pickle_path, "wb") as handle:
        pickle.dump(encoder_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

    with open(decoder_tokenizer_pickle_path, "wb",) as handle:
        pickle.dump(decoder_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

else:

    with open(encoder_tokenizer_pickle_path, "rb") as handle:
        encoder_tokenizer = pickle.load(handle)

    with open(decoder_tokenizer_pickle_path, "rb",) as handle:
        decoder_tokenizer = pickle.load(handle)

Word to Index and Index to Word Mapping¶

In [13]:

# word to index and index to word mapping for encoder
encoder_word2idx = encoder_tokenizer.word_index
encoder_idx2word = dict(map(reversed, encoder_word2idx.items()))

# word to index and index to word mapping for decoder
decoder_word2idx = decoder_tokenizer.word_index
decoder_idx2word = dict(map(reversed, decoder_word2idx.items()))

Total number of words in source and target dictionary¶

In [14]:

encoder_num_words = len(encoder_word2idx) + 1
logger.debug(f"The length of the encoder tokenizer dictionary is : {encoder_num_words}")

decoder_num_words = len(decoder_word2idx) + 1
logger.debug(f"The length of the encoder tokenizer dictionary is : {decoder_num_words}")

2020-07-13 02:49:32,401 - machine_translation - DEBUG - The length of the encoder tokenizer dictionary is : 28267
2020-07-13 02:49:32,402 - machine_translation - DEBUG - The length of the encoder tokenizer dictionary is : 14510

Calculate Maximum length in source and target sentence¶

To provides the fixed number of time steps to our encoder and decoder model we have to pad these sequence of numbers to maximum length. Therefore, we calculate the maximum length of sentences in both source as well as target language.

In [15]:

def max_length(desc1):
    length = []
    for sentence in desc1:
        length.append(len(word_tokenize(sentence)))
    return max(length)

In [16]:

encoder_max_length_pickle_path = os.path.join(
    ".", "data", "fra-eng", version_name, "french_max_len.pkl"
)
decoder_max_length_pickle_path = os.path.join(
    ".", "data", "fra-eng", version_name, "english_max_len.pkl"
)
if (not os.path.exists(encoder_max_length_pickle_path)) or (
    not os.path.exists(decoder_max_length_pickle_path)
):
    encoder_max_len = max_length(data_train_texts["input"])
    decoder_max_len = max_length(data_train_texts["decoder_input"])

    with open(encoder_max_length_pickle_path, "wb") as handle:
        pickle.dump(encoder_max_len, handle, protocol=pickle.HIGHEST_PROTOCOL)
        logger.debug(f"Encoder max length saved at {encoder_max_length_pickle_path}")

    with open(decoder_max_length_pickle_path, "wb") as handle:
        pickle.dump(decoder_max_len, handle, protocol=pickle.HIGHEST_PROTOCOL)
        logger.debug(f"Decoder max length saved at {decoder_max_length_pickle_path}")

else:

    with open(encoder_max_length_pickle_path, "rb") as handle:
        encoder_max_len = pickle.load(handle)

    with open(decoder_max_length_pickle_path, "rb") as handle:
        decoder_max_len = pickle.load(handle)

logger.debug(f"Maximum Encoder seq length is {encoder_max_len}")
logger.debug(f"Maximum decoder Seq length is {decoder_max_len}")

2020-07-13 02:50:03,421 - machine_translation - DEBUG - Encoder max length saved at ./data/fra-eng/version1/french_max_len.pkl
2020-07-13 02:50:03,422 - machine_translation - DEBUG - Decoder max length saved at ./data/fra-eng/version1/english_max_len.pkl
2020-07-13 02:50:03,423 - machine_translation - DEBUG - Maximum Encoder seq length is 55
2020-07-13 02:50:03,423 - machine_translation - DEBUG - Maximum decoder Seq length is 47

Data Generator¶

In [20]:

def load_train_batches(data_train_texts, chunk_size=BATCH_SIZE):

    index_tracker = 0

    while True:

        # retrieve the batch of input texts
        input_train_text = list(data_train_texts["input"])[
            index_tracker : index_tracker + chunk_size
        ]
        # convert the extracted batch to sequences
        train_x = encoder_tokenizer.texts_to_sequences(input_train_text)
        # pad the input sequences to the max length
        train_x = pad_sequences(train_x, maxlen=encoder_max_len, padding="post")

        # retrieve the batch of decoder-input texts
        target_train_input = list(data_train_texts["decoder_input"])[
            index_tracker : index_tracker + chunk_size
        ]
        # convert the extracted batch to sequences
        train_y = decoder_tokenizer.texts_to_sequences(target_train_input)
        # pad the sequences to the max length
        train_y = pad_sequences(train_y, maxlen=decoder_max_len, padding="post")

        # retrieve the batch of decoder-input texts
        target_train = list(data_train_texts["decoder_target"])[
            index_tracker : index_tracker + chunk_size
        ]
        # convert the extracted batch to sequences
        train_z = decoder_tokenizer.texts_to_sequences(target_train)
        # pad the sequences to the max length
        train_z = pad_sequences(train_z, maxlen=decoder_max_len, padding="post")

        decoder_targets_one_hot = np.zeros(
            (train_x.shape[0], decoder_max_len, decoder_num_words), dtype="float32"
        )
        # assign the values
        for i, d in enumerate(train_z):
            for t, word in enumerate(d):
                decoder_targets_one_hot[i, t, word] = 1

        index_tracker += chunk_size
        if index_tracker >= len(data_train_texts):
            index_tracker = 0
            data_train_texts = shuffle(data_train_texts).reset_index(drop=True)

        yield [train_x, train_y], decoder_targets_one_hot


def load_valid_batches(data_valid_texts, chunk_size=BATCH_SIZE):
    valid_index_tracker = 0

    while True:

        # retrieve the batch of input texts
        input_valid_text = list(data_valid_texts["input"])[
            valid_index_tracker : valid_index_tracker + chunk_size
        ]
        # convert the extracted batch to sequences
        valid_x = encoder_tokenizer.texts_to_sequences(input_valid_text)
        # pad the input sequences to the max length
        valid_x = pad_sequences(valid_x, maxlen=encoder_max_len, padding="post")

        # retrieve the batch of decoder-input texts
        target_valid_input = list(data_valid_texts["decoder_input"])[
            valid_index_tracker : valid_index_tracker + chunk_size
        ]
        # convert the extracted batch to sequences
        valid_y = decoder_tokenizer.texts_to_sequences(target_valid_input)
        # pad the sequences to the max length
        valid_y = pad_sequences(valid_y, maxlen=decoder_max_len, padding="post")

        # retrieve the batch of decoder-input texts
        target_valid = list(data_valid_texts["decoder_target"])[
            valid_index_tracker : valid_index_tracker + chunk_size
        ]
        # convert the extracted batch to sequences
        valid_z = decoder_tokenizer.texts_to_sequences(target_valid)
        # pad the sequences to the max length
        valid_z = pad_sequences(valid_z, maxlen=decoder_max_len, padding="post")

        decoder_targets_one_hot_valid = np.zeros(
            (valid_x.shape[0], decoder_max_len, decoder_num_words), dtype="float32"
        )
        # assign the values
        for j, d1 in enumerate(valid_z):
            for k, word in enumerate(d1):
                decoder_targets_one_hot_valid[j, k, word] = 1

        valid_index_tracker += chunk_size

        if valid_index_tracker >= len(data_valid_texts):
            valid_index_tracker = 0
            data_valid_texts = shuffle(data_valid_texts).reset_index(drop=True)

        yield [valid_x, valid_y], decoder_targets_one_hot_valid

In [21]:

train_gen = load_train_batches(data_train_texts, chunk_size=BATCH_SIZE)
valid_gen = load_valid_batches(data_valid_texts, chunk_size=BATCH_SIZE)

4. Model Definition and Training¶

The RNN units in Encoder and Decoder can be RNN,GRU or LSTM cells. In this notebook we use LSTM.The heart of LSTM is it’s cell which provides memory. The cell is made up of three types of gates

Input Gate : Feeds the new information that we’re going to store in the cell state.
Output Gate : Provides the activation to the final output of the lstm block at timestamp ‘t’.
Forget Gate : Decides what information to throw away from the cell state.

The equations of the LSTM gates are :

$i_t$ = $\sigma$($w_i$[$h_{t-1}$,$x_t$]+$b_t$) $f_t$ = $\sigma$($w_f$[$h_{t-1}$,$x_t$]+$b_f$) $o_t$ = $\sigma$($w_o$[$h_{t-1}$,$x_t$]+$b_o$)

where,

$i_t$ $\Rightarrow$ represents input gate $f_t$ $\Rightarrow$ represents forget gate

$o_t$ $\Rightarrow$ represents output gate $\sigma$ $\Rightarrow$ represents sigmoid function

$w_x$ $\Rightarrow$ weight for the respective gate(x) neurons $h_{t-1}$ $\Rightarrow$ weight for previous lstm block(at timestamp t-1)

$x_t$ $\Rightarrow$ input at current timestamp $b_x$ $\Rightarrow$ Biases for the respective gates(x)

The equations for the cell state, candidate cell state and the final output are :

$ \tilde{ct}$ = tanh($w_c$[$h_{t-1}$,$x_t$]+$b_c$) $c_t$=$f_t$ $c_{t-1}$ + $i_t$ $\tilde{ct}$ $h_t$=$o_t$ * tanh($c^t$)

where,

$c_t$ $\Rightarrow$ cell state(memory) at timestamp(t) $\tilde{ct}$ $\Rightarrow$ represents candidate for cell state at timestamp(t)

Note : represents the element wise multiplication of the vectors.*

Lastly, we filter the cell state and then it is passed through the activation function which predicts what portion should appear as the output of current LSTM unit at timestamp t.

We can pass this $h_t$ the output from current lstm block through the softmax layer to get the predicted output ($y_t$) from the current block.

In [30]:

# ENCODER
input_len = (encoder_max_len,)
encoder_input_layer = Input(shape=input_len)
# pass the input to emedding layer
encoder_embedding_layer = Embedding(encoder_num_words, LATENT_DIM)
encoder_embedding = encoder_embedding_layer(encoder_input_layer)
# Create a LSTM encoder and pass the input embeddings to it
encoder_lstm = LSTM(LATENT_DIM, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
# save the hidden states of the encoder
encoder_states = [state_h, state_c]

# DECODER
target_len = (decoder_max_len,)
decoder_input_layer = Input(shape=target_len)
# pass the input to emedding layer
decoder_embedding_layer = Embedding(decoder_num_words, LATENT_DIM)
decoder_embedding = decoder_embedding_layer(decoder_input_layer)
# Create a LSTM encoder and pass the input embeddings to it
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(LATENT_DIM, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)


decoder_dense = Dense(decoder_num_words, activation="softmax")
decoder_outputs = decoder_dense(decoder_outputs)

In [31]:

model = Model([encoder_input_layer, decoder_input_layer], decoder_outputs)

In [32]:

plot_model(model, show_shapes=True)

Out[32]:

In [23]:

# Compile the deep learning model
model.compile(
    optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"]
)

In [25]:

if os.path.exists("modelv1_loss0.32.h5"):
    model.load_weights("modelv1_loss0.32.h5")
else:
    
    es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=3)
    machine_translate=model.fit_generator(train_gen,
                                          steps_per_epoch=int(np.ceil(data_train_texts.shape[0] / BATCH_SIZE)),
                                          validation_data=valid_gen,
                                          validation_steps=int(np.ceil(data_valid_texts.shape[0] / BATCH_SIZE)),
                                          callbacks=[es],
                                          epochs=EPOCHS,
                                          verbose=1)
    model.save('modelv1_loss0.32.h5')
    ## summarize history for accuracy 
    plt.plot(machine_translate.history['accuracy'])
    plt.plot(machine_translate.history['val_accuracy'])
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 'valid'], loc='upper right')
    plt.show()

    # summarize history for loss
    plt.plot(machine_translate.history['loss'])
    plt.plot(machine_translate.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'valid'], loc='upper right')
    plt.show()

/home/mumu/ml/img_cap/lib/python3.6/site-packages/tensorflow_core/python/framework/indexed_slices.py:433: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

Epoch 1/25
5671/5671 [==============================] - 5024s 886ms/step - loss: 0.7040 - accuracy: 0.8880 - val_loss: 0.7162 - val_accuracy: 0.8962
Epoch 2/25
5671/5671 [==============================] - 5048s 890ms/step - loss: 0.5715 - accuracy: 0.9057 - val_loss: 0.5742 - val_accuracy: 0.9131
Epoch 3/25
5671/5671 [==============================] - 4994s 881ms/step - loss: 0.5133 - accuracy: 0.9179 - val_loss: 0.7017 - val_accuracy: 0.9222
Epoch 4/25
5671/5671 [==============================] - 5009s 883ms/step - loss: 0.4719 - accuracy: 0.9262 - val_loss: 0.4218 - val_accuracy: 0.9279
Epoch 5/25
5671/5671 [==============================] - 4941s 871ms/step - loss: 0.4341 - accuracy: 0.9323 - val_loss: 0.4345 - val_accuracy: 0.9330
Epoch 6/25
5671/5671 [==============================] - 5863s 1s/step - loss: 0.4004 - accuracy: 0.9373 - val_loss: 0.3950 - val_accuracy: 0.9370
Epoch 7/25
5671/5671 [==============================] - 5643s 995ms/step - loss: 0.3683 - accuracy: 0.9419 - val_loss: 0.4604 - val_accuracy: 0.9403
Epoch 8/25
5671/5671 [==============================] - 5778s 1s/step - loss: 0.3413 - accuracy: 0.9457 - val_loss: 0.3231 - val_accuracy: 0.9427
Epoch 9/25
5671/5671 [==============================] - 4915s 867ms/step - loss: 0.3195 - accuracy: 0.9488 - val_loss: 0.3802 - val_accuracy: 0.9445
Epoch 10/25
5671/5671 [==============================] - 5338s 941ms/step - loss: 0.3011 - accuracy: 0.9513 - val_loss: 0.4292 - val_accuracy: 0.9460
Epoch 11/25
5671/5671 [==============================] - 5856s 1s/step - loss: 0.2835 - accuracy: 0.9539 - val_loss: 0.3248 - val_accuracy: 0.9468
Epoch 00011: early stopping

5. Inference¶

After the model is trained, it can be used to generate new sequences. To do that, we should feed a ”\< sos > ” token to the Decoder RNN along with the intermediate representation of the source sentence from the Encoder. The generated word in the output sequence will then be used as input on the subsequent time step along with the hidden states. The generation process will continue until the ”\< eos >” token is reached, or a maximum sequence length is reached.

In [24]:

encoder_model = Model(encoder_input_layer, encoder_states)

state_input_h = Input(shape=(LATENT_DIM,))
state_input_c = Input(shape=(LATENT_DIM,))
states_inputs = [state_input_h, state_input_c]

decoder_inputs_word = Input(shape=(1,))

decoder_inputs_word_x = decoder_embedding_layer(decoder_inputs_word)

decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs_word_x, initial_state=states_inputs
)

decoder_states = [state_h, state_c]

decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = Model(
    [decoder_inputs_word] + states_inputs, [decoder_outputs] + decoder_states
)

In [27]:

def decode_sequence1(test_sen):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(test_sen)

    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = decoder_word2idx['<sos>']

    eos = decoder_word2idx['<eos>']
    output_sentence = []

    for _ in range(decoder_max_len):
        
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        idx = np.argmax(output_tokens[0, 0, :])
        if eos == idx:
            break

        word = ''
        if idx > 0:
            word = decoder_idx2word[idx]
            output_sentence.append(word)

        # Update the decoder input
        # which is just the word just generated
        target_seq[0, 0] = idx

        # Update states
        states_value = [h, c]

    return (" ".join(output_sentence))

In [38]:

def evaluate_test_data(data_test_texts):

    predicted_results = pd.DataFrame(
        columns=["Input_Sentence", "Target_Sentence", "Predicted_Sentence"]
    )

    for seq_idx in range(len(data_test_texts)):
        # retrieve the batch of input texts
        input_test_text = list(data_test_texts["input"])[seq_idx : seq_idx + 1]
        target_test_text = list(data_test_texts["target"])[seq_idx : seq_idx + 1]
        # convert the extracted batch to sequences
        test_x = encoder_tokenizer.texts_to_sequences(input_test_text)
        # pad the input sequences to the max length
        test_x = pad_sequences(test_x, maxlen=encoder_max_len, padding="post")

        decoded_sentence = decode_sequence1(test_x)

        predicted_results.loc[seq_idx] = [
            data_test_texts.iloc[seq_idx]["input"],
            data_test_texts.iloc[seq_idx]["target"],
            decoded_sentence,
        ]
    return predicted_results

predicted_results=evaluate_test_data(data_test_texts)

In [36]:

display(predicted_results[:10])

	Unnamed: 0	Input_Sentence	Target_Sentence	Predicted_Sentence	Bleu_1	Bleu_2	Bleu_3	Bleu_4
0	0	avezvous termine votre petitdejeuner	are you through with your breakfast	have you finished breakfast with breakfast	0.571429	0.308607	0.270612	0.262691
1	1	elles ont disparu	they disappeared	they are waiting for me	0.148753	0.128824	0.123529	0.119884
2	2	je veux savoir ce que vous voulez dire	i want to know what you mean	i want to know what you mean	1.000000	1.000000	1.000000	1.000000
3	3	jai eu ton age	i used to be your age	i had your age	0.571429	0.436436	0.340163	0.312394
4	4	je pense que vous vous etes tous rencontres	i think that youve all met	i think you all met	0.714286	0.597614	0.418579	0.365555
5	5	jai mal aux dents	i have toothache	i have a time	0.537398	0.506664	0.453477	0.426052
6	6	si tu pars maintenant tu vas certainement te r...	if you leave now im sure youll be caught in a ...	if you go now lets get to bed	0.285714	0.209657	0.157060	0.135086
7	7	je ne pense pas que ce soit suffisant	i dont think this is enough	i dont think thats enough	0.714286	0.597614	0.526160	0.434721
8	8	tu nas pas lair convaincu	you dont seem convinced	you dont sound	0.600000	0.547723	0.467735	0.472871
9	9	je ne regrette pas cette decision	i dont regret that decision	i dont regret this decision	0.833333	0.707107	0.632878	0.537285

6. Model Evaluation¶

Now we have to provide a single numerical score to the translation generated which tells us how "good" our translation is when compared to the reference values.

Bilingual Evaluation Understudy - BLEU Score¶

BLEU is the standard machine translation (MT) evaluation metric which compares the machine generated translation to the set of good quality reference translations and counts the number of matches in a weighted fashion .BLEU score ranges between 0 and 1. The more the number of matches, the closer our translation is to the reference translations, the more BLEU score tends to 1. BLEU score is calculated on n-gram model.

\begin{equation} P_n\ =\ \frac{\sum_{ngram}\ Count\_Clip(n-gram)}{\sum_{ngram}\ Count(n-gram)} \end{equation}\begin{equation} Bleu\ Score\ =\ (B_p)\ *\ exp(\frac{1}{n}\ *\sum{1,n}P_n) \end{equation}

where $B_p$ is the Brevity Penalty which is calculated as,

$$ f(n)= \begin{cases} 1&\mbox{if translation}\ output\ length > ref\ output\ length\\ \exp(1 - \frac{ref-output\ length}{translation-output\ length}),&\mbox{otherwise}\\ \end{cases} $$

In [44]:

def calculate_sent_bleu(predicted_results):
    
    # initialize lists for tracking bleu scores
    bleu_1, bleu_2, bleu_3, bleu_4 = [list()] * 4
    for i in range(len(predicted_results)):
        hypothesis=predicted_results.iloc[i]['Target_Sentence'].split()
        list_of_ref=[predicted_results.iloc[i]['Predicted_Sentence'].split()]
        bleu_1.append(sentence_bleu(list_of_ref, hypothesis,weights=(1, 0, 0, 0),smoothing_function=SmoothingFunction().method2))
        bleu_2.append(sentence_bleu(list_of_ref, hypothesis,weights=(0.5, 0.5, 0, 0),smoothing_function=SmoothingFunction().method2))
        bleu_3.append(sentence_bleu(list_of_ref, hypothesis,weights=(0.33, 0.33, 0.33, 0),smoothing_function=SmoothingFunction().method2))
        bleu_4.append(sentence_bleu(list_of_ref, hypothesis,weights=(0.25, 0.25, 0.25, 0.25),smoothing_function=SmoothingFunction().method2))
        
    predicted_results['Bleu_1']=bleu_1
    predicted_results['Bleu_2']=bleu_2
    predicted_results['Bleu_3']=bleu_3
    predicted_results['Bleu_4']=bleu_4
    
    return predicted_results

In [45]:

result_try=calculate_sent_bleu(predicted_results)

In [22]:

print("BlEU-1 :  ", result_try["Bleu_1"].mean())
print("BlEU-2 :  ", result_try["Bleu_2"].mean())
print("BlEU-3 :  ", result_try["Bleu_3"].mean())
print("BlEU-4 :  ", result_try["Bleu_4"].mean())

BlEU-1 :   0.6149585050360737
BlEU-2 :   0.5410856670328017
BlEU-3 :   0.5000828719096638
BlEU-4 :   0.47073541733215074

In [48]:

# predicted_results.to_csv("./data/fra-eng/version1/modelv1_results.csv")